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2^ ■ Abstract 

o . 

' In this paper, we provide an explicit probability distribution for classification purposes. 

It is derived from the Bayesian nonparametric mixture of Dirichlet process model, but 
with suitable modifications which remove unsuitable aspects of the classification based on 
, ^ I this model. The resulting approach then more closely resembles a classical hierarchical 

grouping rule in that it depends on sums of squares of neighboring values. The proposed 
probability model for classification relies on a simulation algorithm which will be based on 
a reversible MCMC algorithm for determining the probabilities, and wc provide numerical 
illustrations comparing with alternative ideas for classification. 

■ Keywords: Classification; MCMC sampling; MDP model. 

^ , 1. Introduction. Suppose we observe data (yi, . . . , ?/„) which are real numbers on (— oo, +oo). 

I The aim is to classify these data into k < n groups and to determine which ones are in the 

same group. This is a clas sic problem and curren t Baye sian approaches rely on mixture 



(N 



I cess model (see, for example, lEscobari ( 19941 )). In the Richardson and Green model the k is 



models, such as described in iRichardson and Green! (jl997l ). or the mixture of Dirichlet pro- 
cess model (see, for exj 
modeled explicitly via 



k 



in ; 

' where the Wk = {wj^k)j=i are weights which sum to one. Prior distributions are assigned 

O ■ to (w k, k) and (^j,(t|)^^ and inference is made possible via reversible jump MCMC, [Green 

( 19951 ). The likelihood function for n observations is given by 

X 



l{k,w,fx,a'^;y,d) oc w^^^fc N(yj; /i^,, a^. 



i=l 

where the {di)^^^ are latent variables which pick out the component, less than or equal to k, 
from which the ith observation is coming from. 

On the other hand, the mixture of Dirichlet process (MDP) model is based on the density 
function 

oo 

where the weights (wj)^-^ sum to one. The parameters {wj, fij, cr|) are assigned distributions 
and, since the classification ideas we have are based on this model, we will elaborate. So, 
the wi = vi and, for j > 1, Wj = VjYli^ji^ — vi) with the (vi) being independent and 
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also independent and identically distributed (the prior) and we consider the prior as 



identically distributed as beta(l,^) random variables for some 9 > Q. The (/ij, Xj = ) are 



7r(/Lt| A) = N(yu; 0, (cA) ^) and 7r(A) = gamma(A; a, 6). 
In this case the corresponding likelihood function is given by 



i=l 

So, in the Richardson and Green model, the k is explicit, but in the MDP model it is implicit, 
and taken to be the number of distinct (di). 

However, we are not convinced that either of these models are useful for classification 
purposes. The key to the problem is that locations of the normal distributions, the (nj), can 
be arbitrarily close to each other and therefore register as different clusters. So two fij close 
to each other register as two clusters with a certain probability on the MDP model, since 
the di liking this location may be all one or the other; but does register as two clusters in 
the Richardson and Green model. Such a scenario may well happen when clusters are not 
normal based, for example. We would also from this point of view expect the MDP model 
to perform slightly better than the Richardson and Green model when using as classification 
modeling. But both methods would over-estimate the number of clusters. We will discuss 
this issue later in Section 4 when we do some numerical illustrations. Nevertheless, the issue 
of overes timation of the number of c luster s for the MDP has already been known; see, for 



example, iMcGrorv and TitteringtonI (j2007l ) 



Our approach is not model based yet the starting point is the MDP model; since we 
believe a classification procedure based on the {di) is preferable. Hence, from the MDP 
model we compute p{d\y), by integrating out the {w, fi, A). But this p{d\y) will include many 
arrangements which are strange for classification purposes. For example, there is positive 
probability on di and dii both being the same j yet yi and y^' can be the largest and smallest 
observation, and observations in between these two extremes are being allocated to different 
groups. So, at this point we simply study p{d\y) as a classification probability model and 
adjust it to eliminate groupings which just don't make any sense. Indeed, it is these such 
types of d which cause the problems with the MDP model as a classifier in the first place. 

So, we first order the y, so that yi is the smallest observation and y„ is the largest 
observation. We then constrain the d so that the {di) are non-decreasing. This ensures that 
any group contains only consecutive y's. For example, group 1 would contain a number of 
the smallest observations; group 2 would contain a number of the next smallest observations; 
while group k would contain a number of the largest observations. Thus, for any trio of 
{Vii < yi2 < 2/43)1 if Uii a-i^d are in the same group, then so is yi^. It follows then that 
our p*(d|y), with the ordered y's, is given by p*{d\y) oc p{d\y) l{di < • • • < dn)- We then 
show how to sample from p*{d\y) in order to compute classifications with high probability, 
and obviously the mode. 

In Section 2 we derive and explain our probability model for classification. Section 3 
then describes a MCMC algorithm for sampling from this probability model; since for large 
n the number of possible clusterings is prohibitively large to compute directly. Section 4 
then presents numerical illustrations based on a toy example of 10 data points whereby all 
probabilities can be computed and the well known and widely studied galaxy data set. 
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2. The classification probability model. Given the outline in the Introduction, our first 
task is to compute p{d\y) based on the MDP model. Now 

n 

p{d,y\fi,X,w) = Y[wd,^(,yi; fJ-d,, 
1=1 

and so 

p{d,y) =E{lYl^,Vd,Ui<dS^-vi)} n°^i/nrf,=, N(yi;/.,A"i)^(d^,dA) 

= nr=i {0 f v-^ {I - vr^-^<^~' dv} {fUd.=j N(y,;^, A-i)vr(d/i,dA)} . 

Here, rij = X^iLi = j) mj = Y17=i ^(^i > i)- '^^^^ ^^^^ term in the product is given 
by 



-■-J: rfi 



j=i r(l + 6* + Tij + 
and the second term is easily found to be given by 



a+nw/2 

b + S]/2j Vc+^r(a) 



where 
and 



= ^ y*- 

Hence, 

This then is the probability of classification based on the MDP model. 

At this point, we simply focus on p{d\y) and assess it as a probability model for classifica- 
tion. So, without loss of generality, we take the y's to be ordered, with yi being the smallest 
observation and yn being the largest. For reasons then given in the Introduction, we would 
now for classification purposes only wish to consider the (dj) to be non-decreasing. Hence, 
we consider p*{d\y) oc l{di < ■ • • < dn)p{d\y). We also impose the constraint that if there 
are k distinct (dj) then dn = k. 

Our observation now is that d is completely determined by {k,ni, . . . ,nk) whereby k is 
the number of distinct (dj) and rij is the number of the di equal to j. Hence, 

' er{i + n,ne + m,) r(a + n,/2)6° 

p[k,m,...,nk)-^ II ^ r ,a+n,/2 

J=i ^ ^ ^' \h + Sy2\ + njT{a) 



3 



where k is the normalizing constant, and now we define mj = n — ni — ■ ■ ■ — rij and 

-2 

and 

n* 

i=n*_i+l 

with n* = ?ii + • • • + nj and ng = 0. Note that, for a given sample size n, the support of 
this probability runs over the set of compositions of the integer n rather than on the number 
of partitions of a set with n elements typically found in the MDP or other exchangeable 
partition probability functions settings encounter in the Bayesian nonparametric literature. 

This probability model for classification is a highly suitable and necessary adaption of 
the probability model for classification based on the MDP model. It can be seen to depend 
fundamentally on the sample variances of the observations in the same group. So the lower 
the sample variances, the higher the probability. The rule of having k = n groups is countered 
by the probability being a product of k terms. The probability depends on the parameters 
{9,a,b,c), which would basically have the same interpretation as if we were using a MDP 
model for the data. So, for example, if 9 is big, which implies a large number of groups in 
the MDP model, its role can be seen explicitly in p{k,ni, . . . ,nk), since we would have the 
term 6^ and so encourages large k. 

We also note that attempts have been made to emphasize the suitable (/c, ni, . . . , rifc) 
by using alternative nonparametric mixing prior distributions to the Dirichlet process, which 
co nstitutes the MDP model, and which put more weight on configurations which are realistic, 
Liioi et d\^2mii ). However, positive mass is still being put on "ridiculous" configurations 



see 



which will lead to overestimation of k. Our approach, in light of this, is remarkably obvious 
in that we put zero weight on all but realistic configurations. 

Here we also mention the problem of what happens if a new piece of data arrives. Our 
approach is not to assume a clustering for the existing data has been set and we decide into 
which group, possibly a new one, the extra piece of data should be put; but rather we merely 
recompute p{k,ni, . . . ^rik) with all the data, including the additional piece. We do not see 
any other approach as being relevant here. 

We will compare our approach with a routine in the package R, a hierarchical clustering 
routine based on local sums of squares, so in principle is not unlike the idea of working with 
sample varia nces. The ro utine is labeled hclust in R and is based on an original algorithm 
appearing in WardI ( 19631 ). 



3. Sampling the model. The basic idea for sampling from p{k, ni, . . . , n^) will be a split- 
merge MCMC algorithm. So at each iteration one of 2 types of move will be proposed: a 
split, whereby a group of size bigger than 1 is divided into 2 groups so k is increased by 1; 
and a merge, whereby 2 groups are combined into 1 group so k is decreased by 1. The idea 
for sampling from p(/c, ni, . . . , n^) can be seen as a reversible jump MCMC algorithm, and for 
ease of exposition we will describe the algorithm using latent variables and the specification 
of a joint density for a configuration conditional on a k: so let rS^'^ for j = 1, ... ,n be a 
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clustering for j groups, and consider 



n k—1 

j=k+i j=i 



which is based on a recent idea described in IWalkeil (|2009l V The concern is that the marginal 



density for {k,ni, . . . , n^) is unchanged; which is the case, as is evidently obvious. Now given 
a k and n^^'^ we propose a move to A; + 1 with probability 1/2 and to A; — 1 with probability 
1/2 (with obvious modifications if fc = 1 or /c = n). We need to therefore sample n^^~^^^ 
imrsi p{n^^^^^\k,n^^^) and n^^~^^ iioTo. p{n^^"'^^\k,n^^^). The former is achieved by finding an 
existing group with size > 1, and then we split this group into 2. If uniform distributions are 
used for both operations then 



where rig'^ is the number of groups of size > 1, from (ni, . . . ,nfc), and ni'^^ is the size of 
this group chosen. The latter is obtained by merging two neighboring groups and so with a 
uniform distribution we have 

p(n('=-^)|A;,n('=)) = 1/(A; - 1). 

Therefore, the sampler carries out each step through a Metropolis-Hastings scheme. When 
at state x^^'^ = (A:, ni, . . . , n^,) the acceptance probability to move to state x^^~^^^ = {k + 
l,ni, . . . ,nfc+i) is 



aix^^Kx^^^""^) = min ■ 



q ^(xC^)) k 



where q = 1/2. On the other hand, if instead two groups, {fisi ^"^^32)^ ™ -"^^^^ selected and 
we attempt to merge them, then the acceptance probability for this move is 



a[x 



W rr(^-i))=min|l ^ ^^^^'''^^ ^-^1 



where rig is the cardinality of the set containing all groups with more than one observation 
in x^^~^\ 

To improve the algorithm we then shuffle the n^^'^ by selecting adjacent groups, (nsi^ns2) 
and attempting to change them into {n*i,n*2) in such a way that both n*^ and n*2 > 1. The 
shuffle is based on the idea of putting the two groups together and then uniformly splitting 
into 2 groups. The acceptance probability is then given by 

a(.,x*)=mm|l,^("*)^"-+"--l) 



p{x) {usi + - 1) 



These acceptance probabilities all follow from the expression p{k, rS-^\ . . . , n^"^) and the can- 
celations which occur when evaluating the ratios of neighboring k. 

The algorithm is effectively a joint Metropolis-Hastings and Gibbs algorithm, rather than 
a reversible jump MCMC algorithm. The dimension is fixed and so no special considerations 
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arise on this issue. The only necessary consideration is that if p(n^'^^ |n'^'^~^)) > then 
^^j^(fc-i)|j^(fc)-j y .yjpg versa. Neither, if one is even needed, have we had to worry 

about a Jacobian, since we are not basing the moves on transformations of variables between 
different dimensions. We believe it is more explicit to understand reversible jump MCMC 
from this perspective. 

Here we consider a more general idea for sampling, based on the notion of a joint density 

p{k, n(^) , . . . , n(")) = p{k, p{n^^\ n^^^-^), n(*^+i) , . . . , n^""^) In^^)). 

Then it is easy to see how the reversible jump MCMC arises from this model. But we can 
seek alternative, and more general strategies, and one such is based on the idea of 

k-l n-l 
j=3 j=k+l 

where p{n^^^\n^''~^^) is the probability density for a split move described earlier, and p{n^'^^) 
is the correct density for n^"^^ given k = 2, and is easy to sample since n*-^-* can be represented 
by a single number between 1 and n — 1. Then it is easy to see that a move from x^^^ to 
x^*^'), with k' £ (fe — 1,/c + 1) can be achieved by first sampling x^'^^^^ from p(n^^^^^\n^''^) 
and x^^~^'^ from the density p{n^'^^) Y]^Z)^p{n^-^'^\n^^~^^)^ which is done by sampling n^^\ then 
n^'^\ and so on, up to n^'^~^\ If p{n^^'^\'nP^~'^^) = then the proposed move is rejected and 
{k,n^^^) is kept. On the other hand, if p{rS-^'^\n'^^~^^) > then a move to A; + 1, proposed 
with probability 1/2 is accepted with probability 

f pik + l,n^^+^^)pin^^^\n^^-^h\ 
min < 1, 



or else a move to fc — 1 is proposed and is accepted with probability 

p{k — l,n^^~^^) p{n^^^\n^^~^^) \ 



min < 1, 



p{k,n^^y) p{n^^ ^)\n^^ 2)) 



While in this particular case we do not claim an improvement using this alternative, the point 
is that there are alternatives to be considered. In this way it can be seen that the reversible 
jump MCMC methodology can be viewed as a special case of a particular idea formulated 
by the notion of a joint density 

p{k, n(^) , . . . , n(")) = p{k, p{n^'^\ n^^^-^), n(*^+i) , . . . , n^""^) jn^'^)). 

To see how the algorithms work here; let us ease the notation by writing Oj. = n^^^ and be 
the {n^^^;j k}. Then, at {k, dk), we sample 9_f^ from the full conditional, but in the original 
algorithm only need 6k-i and 6k+i, and then do a Metropolis-Hastings step for {k,6k) where 
the proposal is to complete the joint density with A: + 1, or k — 1, and the retention of 9^. 

4. Numerical illustrations . In order to underline the kind of results that can be obtained 
by our approach we first consider a small data set; small enough (n = 10) so that we can 
provide exact computations of probabilities for all (fc, ni, . . . , n^). We then illustrate our 
approach with a real data set; the galaxy data set. 
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4.1. Small data set. Suppose the set of ordered observations is y = (-1.522, -1.292, -0.856, 
-0.104, 2.388, 3.080, 3.313, 3.415, 3.922, 4.194), and a histogram of the data is shown in 
Figure [TJ from this it is evident that 2 groups are the most hkely option. Table [T] gives 
the probabilities of having k groups using the MDP approach, and our proposed approach. 
In both cases we use the prior specification of parameters a,s6 = a = h = l and c = 0.1. 
For the MDP model we computed the exact probabilities for each of the 115,975 possible 
partitions of y, and then using them to obtain the exact posterior probabilities for each 
k G {1, . . . , 10}. The highest probability in this case is allocated to A; = 3. Further inspection 
among the partitions indicates that the highest posterior probability of 0.332 corresponds 
to the classification involving 2 groups; {[yi, ?/2i ysi y4]i [2/5, • • • > yio]}) which corresponds to 
(ni,n2) = (4,6). 

On the other hand, the exact probabilities, p*{k), computed over all the 512 possible 
configurations, assign the highest probability to = 2. As with the MDP case, the classi- 
fication with the highest probability corresponds to {[yi, y2i 2/3, 2/4]) [2/5, • • • > Uio]}] but in this 
case with the considerably higher probability of 0.833. Clearly, considering the order of the y 
limits the support considerably, from set partitions to integer compositions, withdrawing all 
inadequate partitions for classification purposes and hence leading to an improved estimator 
for the number of groups. 

The last four columns in Table[T]show the estimates oip*{k) based on the MCMC schemes 
described in Section[3l with 10, 000 and 100, 000 iterations following a burn in period of 1, 000 
and 10, 000 iterations respectively. The results are matching the exact probabilities and from 
these it is evident that both schemes are valid, although the first appears to converge faster 
than the second. 

Our results are in agreement with the hierarchical agglomerative clustering, using Ward's 
(1963) approach, which reduces the number of groups from A; to A; — 1 by minimizing the local 
sum of squares; see Figure [2j It is obvious from this that 2 groups are by far and away the 
preferred choice. 



4.2. Galaxy data set. Here we consider the galaxy data set, studied in lEoedeil (jl99nl V It 



is widely used in the literature to illustrate methodology for mixture modeling. In this case 
the sample size is n = 82 and so we would need to compute 2^^ probabilities to obtain all the 
possible configurations. 

Therefore, we will use the first MCMC algorithm proposed in Section 3 to obtain the 
probabilities. We undertake this approach, with the same parameter specifications as in the 
above small data set example and 10000 iterations after 1000 of burn in period. The MCMC 
estimates result in p*[k = 3) = 0.997 and p*{k = 4) = 0.003, with the highest probability of 
0.677 on the configuration (ni, ?i2, na) = (7,72,3). The same results are attained with the 
second scheme of Section [3] but with a higher number of simulations. 

Depending on the parameter values, e.g. the total mass parameter 9] the Bayesian non- 



param etric mixture model fa vors between 5 and 6 groups; see for example lEscobar and West 



(I1995I) andlLiioi et aLl (120051) . Similar results are achieved in the finite mixture setting as in 
Richardson and Green ( 19971 ). All of these a pproaches seem to be overes t imati ng the number 
of groups, as noted from results reported in lMcGrory and TitteringtonI ( 20071 ). 
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Table 1: Probabilities on the different number of groups for the smah data set example. The 
MDP results correspond to exact posterior probabilities. The probabilities p*{k) and p*{k) 
for the classification model correspond to the exact and MCMC estimates, respectively. The 
columns labeled Ml and M2 refer to the two sampling schemes described in Section [3] with 
(A) 10000 iterations after a 1000 burn in period and (B) 100000 iterations after a 10000 burn 
in period. 
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Figure 1: Histogram for small data set. 



9 



o 



in 



o 



OJ C3^ O 



Figure 2: Dendogram for the small data set example.. 
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